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Abstract. We study supervised learning and generalisation in coupled perceptrons 
trained on-line using two learning scenarios. In the first scenario the teacher and 
the student are independent networks and both are represented by an Ashkin- Teller 
perceptron. In the second scenario the student and the teacher are simple perceptrons 
but are coupled by an Ashkin- Teller type four-neuron interaction term. Expressions 
for the generalisation error and the learning curves are derived for various learning 
algorithms. The analytic results find excellent confirmation in numerical simulations. 
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1. Introduction 

One of the more interesting properties of neural networks is their abihty to learn from 
examples. In on-line learning processes a student network updates its couplings after 
the presentation of each example in order to make its outputs agree with the outputs 
of the teacher. In the standard situation the student knows only the inputs and the 
corresponding outputs of the teacher and has no further knowledge of the rule used 
by the latter. Furthermore, in the course of learning the student is able to classify 
correctly also new examples, which it has never seen before. The latter property is 
called generalisation. 

Various aspects of learning and generalisation in neural networks have been 
intensively studied in many different contexts. For about a decade now statistical 
mechanical methods have been used successfully in these studies (for recent reviews 
see, for example |, |, |]). 

A lot of the theoretical research has been concentrated on the simplest models, 
such as the binary perceptron. Parallel to the progress in these investigations, new more 
realistic models have been considered, e.g., models with multi-state neurons |p, models 
with multi-neuron interactions 0, 0], models with many layers (see, e.g, p, ^ |I0|). 

In this paper we study on-line learning and generalisation in a recently introduced 
model, allowing two different types of binary neurons at each site, possibly having 
different functions [[TT], |T2|]. More specifically, this so called Ashkin- Teller (AT) 
perceptron contains, besides two-neuron interaction terms, also a four-neuron interaction 
term. For the underlying biological motivation for the introduction of different types 
of neurons we refer to Here, we recall that the maximal capacity of the AT 

perceptron model II introduced in fll], Q can be larger than the one of the standard 



binary perceptron and that the corresponding recurrent network model can be a 



more efficient associative memory than a sum of two Hopfield models ||T3|]. A natural 
question is then how this AT perceptron performs in on-line learning and generalisation 
tasks. 

Two learning scenarios turn out to be of interest. In the first scenario where the 
student and the teacher are independent AT perceptrons, we show that the resulting 
learning curves do not differ very much from the already known ones for perceptrons 
with multi-state neurons. For some particular values of the network parameters we 
precisely reproduce the learning curve of the 4-state Potts perceptron 

In the second scenario both the student and the teacher are represented by a simple 
perceptron but they are coupled by an AT type four-neuron interaction term. Hence, 
contrary to the standard setup, they are not independent. This can be considered as 
a sort of "hardware" coupling. As a result, also the teacher mapping is changing in 
the process of learning. We obtain a set of learning curves which qualitatively differ 
from those found in the independent setup. We also find different asymptotic behaviour 
when the number of examples increases to infinity. For certain values of the network 
parameters such a coupling describes the realistic situation that the rule used by the 
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teacher is partially shared by the student. 

The rest of the paper is organised as follows. In section |^ the model and the learning 
scenarios are introduced. The formulas for the generalisation error are derived in section 
^. The differential equations for the evolution of the order parameters are obtained in 
section |^. Their solutions, compared with numerical simulations can be found in section 
^. In section |^ some concluding remarks are presented. Finally, two appendices contain 
some technical details of the derivations. 

2. The model and the learning scenarios 

The AT perceptron is defined as a mapping of the binary (±1) inputs {sj, o"j}, i = 1, ...,N 
into two binary (ibl) outputs s and a: 

s = sgn(/ii) + Oi'jslhl - 'Ji\hi\)e{'j2\h2\ - 7i|/ii|)(sgn(/i2/i3) - sgn(/ii)) (1) 
cr = sgn(/i2) + Q{ciz\h'i\ - 72|/i2|)6'(7i|/ii| - 72|/i2|)(sgn(/ii/i3) - sgn(/i2)) , (2) 

where Q is the Heaviside step function and 7t- > 0, r = 1, 2, 3, denote the strength of 
the local fields h^. which are defined as follows 

h. = -Y.jf'^^^^^ nl = Y.^jt'?- (3) 

The mapping (0)-(@) can be equivalently represented by the set of three equations (cfr. 
model I in [|T2l) 



s = sgn(7i/ii + 0-73/13) (4) 
a = sgn(72/i2 + 573/13) (5) 
sa = sgn(o-7i/;,i + 572/^2) • (6) 

For 73 = the outputs s and a are completely independent and defined like in the 
simple perceptron 

s = sgn(/ii) (7) 

a = sgn(/i2) . (8) 

2.1. Learning scenario I 

First, we consider the standard situation where the student and the teacher are two 
completely independent networks. In our case they are represented by AT perceptrons 
meaning that the outputs of the teacher {st, ctt} and of the student {ss, cs} are both 
determined by the mapping (0)-(@) but with different couplings: and Jf respectively, 
with Jr = {J^^}- Initially, the student and the teacher couplings are not correlated. 
At each time step t an example is presented to the student. The student network then 
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updates its couplings according to the following learning rule F 

Jf(t + 1) = jf(t) + ^F.T(t)s(t) (9) 

3l(t + l) = 3l{t) + ^FaTit)cTit) (10) 

Jf (t + 1) = Jf (t) + l^FsT{t)aTm{t) (11) 

where 

s = {si}, cr = {a-i}, ip = {siai} . (12) 

In this scenario we consider only Hebbian learning for which F = 1. Furthermore, 
examples are chosen randomly with equal probability out of the complete set of 
examples. 

2.2. Learning scenario 11 

Alternatively, the AT perceptron can also be seen as two coupled perceptrons, with 
outputs s and a. In the second scenario we precisely analyse learning between such 
coupled perceptrons (or branches of the AT perceptron). The outputs of the student s 
and the teacher a are defined by the equations (P and @) respectively. 

When /13 > 0, the teacher and the student use two different mixtures of two 
perceptron mappings defined by the couplings Ji and J2. It implies that s and a are 
always equal to sgnijii) or sgn(/i2) and sometimes, depending on the relation between 
71^1, 72^2 and 73/^3, s = a. In the limit 73 — 00 the student and the teacher network 
become so strongly coupled that one always has s = a and the mapping (|lD-(|D can be 
simplified to 

s = (J = sgn(/i) h = {hx ■■ \K\> \hy\]x,y = 1,2}. (13) 

For /i3 < 0, the situation is quite different. Even with Ji = J2 there is always a non-zero 
fraction of disagreements between the student and the teacher, as long as 73 > 0. In the 
limit 73 00 the student always disagrees with the teacher, and the mapping ([5)-(0) 
can be written in the form: 

-cr = sgn(/ii) if \hi\>\h2\ ^^^^ 

-cr = -sgn(/i2) if \hi\ <\h2\ 

For any value of the coupling field and 73 = the student and the teacher are 
independent and they use the mappings defined by only one coupling vector (cfr. (|^)- 

(I))- 

In the sequel we take s = cr because the student and the teacher must have the 
same inputs. We remark that this implies that = ^^jf"^ /n^^ (cfr.(^). Again, at 
each time step t an example is presented to the student network and its coupling vector 
Ji is updated as follows 

Ji(t + 1) = Ji(t) + ^F(7i/ii, 73/13, s, a)a{t)s{t) . (15) 



On-line learning and generalisation in coupled perceptrons 



5 



Furthermore, at each time step a new couphng vector J3 is generated thus making the 
couphng between the perceptrons random. The couphng vector of the teacher, J 2, is not 
changed in the process of learning, but later on we average over all possible teachers. 
In this scenario we consider three learning rules F: 

Hebbian F(7i/ii, 73/2,3, s, a) = 1 

Perceptron F(7i/;.i, 73/13, s, a) = 9{—sa) 

Adatron F(7i/ii, 73/13, s, a) = -(0-71/11 + 'y-ihs)e{-sa) 



3. Generalisation error 



A quantity of interest in the sequel is the generalisation error. It is defined as the 
probability that the student and the teacher disagree, i.e. that their outputs are 
different. When the teacher and the student are simple independent perceptrons 
the generalisation error Eg = arccos(p)/7r is a simple function of the overlap p = 
3^ ■ 3^ / in^n^) between the student and the teacher couplings, which in this case plays 
the role of an order parameter. Unfortunately, for more complicated models this relation 
takes a much more involved form (see, e.g., ||^). 



3.1. Scenario I 

In the first scenario the definition of the generalisation error reads 

£9(pi,P2,P3) = ^1-^(1 + srs<?)(l + f^rc^5)^ , (16) 
with the overlaps pr defined by 

P"- = \ S ' 

and with (. . .)/ = / dh'^dh^ . . . Pj{h.^ , h^) denoting the average over the teacher field, 
h'^ = {hj , h2 , h'^}, and the student field, = {/if , /if , /if }, which have a joint 
probability distribution P/(h'^, h^). The averages over these fields are double averages, 
one over the examples and one over the couplings. This arises because the couplings 
and the examples enter the mapping (|l|)-(0) and the learning rules only through the 
local fields. We assume that the examples are taken randomly with equal probability 
out of the full training set. Then, in the thermodynamic limit the local fields become 
correlated Gaussian variables and the joint probability distribution P/(h'^,h^) can be 
written down in the form 

P,(h-. h'') = (d - - Pl)(l - exp + + aslS^ 

ZTT I I p-^^ ^ P2 ^ P3 



l-pl + l-pl + 1-pi 



T\2 
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3 ^ 

eg{pi,P2,P3) = ^-^Ir (19) 



r=l 

with 



1 r^^T J Prh^ 



D/iTerf 



r 



2 Jo ' \V^{T^) 
1 
4 



T 



+ \ I D(/i^, /if)(a+, - a-,)«" - «r~r") + «' + «r"r')«r" + a" „)sgn(/i^/if ) , (20) 



«4. = ^1 - erf f ) - ' Dftjerf f ^vl"?! ± ^VP-'"? ) , (21) 

where Dz = dz exp{—z'^ / 2) /\^27r is the Gaussian measure, r', r" = 1, 2, 3 (r 7^ r' 7^ r" 7^ 
r) and where 

27rv/l - [2 1-pi J 

is a correlated Gaussian. 
3.2. Scenario II 

In the second scenario the generalisation error is given by 

Bgip) = ^1 - 1(1 + sa)^^^ = I dhP,,(h) (1 - i(l + sa)^ , (23) 
with the overlap p defined by 

Jl ■ J2 / N 

p = ^^. 24 

Here again, as in the first scenario, the average over the examples and the couplings is 
done through averaging over the local fields. The examples are chosen randomly with 
equal probability out of the full set of examples. In the thermodynamic limit this leads 
to a Gaussian distribution of the local fields. Since the behaviour of the system strongly 
depends on the sign of the coupling field we consider three different field distributions 
Pii 



P.(h) ^ ((2.)3(1 - p^))-V%.p ^ hl^hl-2h.h2p ^ I 



(25) 



P+(h) = 2 P±(h) e{h,) (26) 

P_(h) = 2P±(h)0(-/i3). (27) 

In the case of the distribution P±. the components of the vector J3 are taken randomly 
(with equal probability) from some interval (—a, a), with a a positive real number. In 
the case of the distributions P+ and P_ these components are chosen in the same way 
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but those values which lead to negative respectively positive values of the field are 
omitted. The generalisation error in these three situations reads, with obvious notation 

1 

eKp) = - aiccos{p) + Ic c = ±, +,- (28) 

where 

I± = ^ {Ui2 - + - U^l) , I+ = -ul2-U^i, /-=M^2 + ^21 (29) 

and 

-•-/:-(-(5i))(-(7#i))' - 

It is easy to realize that only for positive h^, (i.e. for Pjj = P+) the generalisation error 
£:^(p) goes to zero as p goes to 1. It is also equal to zero for any p when Pjj = P+ and 
73 = oo- 



4. Order parameters and their evolution 



As can be seen from the formulas written down in the last section, the generalisation 
error is a function of the overlaps p or pr, which play the role of order parameters in 
the learning process. Their evolution is coupled with the evolution of the norms of the 
couplings Ur and in the thermodynamic limit — ^ oo it can be described by ordinary 
differential equations 

In the first scenario a standard calculation (for a review see, e.g., [0]) leads to the 
following result for Hebbian learning 



1,2,3 



(31) 



where ^{ = st, = ^r, ^3 = stCt and a = t/N is the number of examples scaled 



da 2nr da Ur 

ij = ST, 

with the size of the system. It becomes continuous in the thermodynamic limit. After 
performing the averages we arrive at 

drir , 1 dpr 1 — p^ . 



da 



,1 dp^ 

PrOr + — -T- 

lUr da 



with the quantity 6^ given by 



+ 



1 



1-2 



1 + 



7r 
Ir' 



1 - 2 
2 



m erf 



Pr 

2nl 



(32) 



Dh erf 



h'-fr" 



^r''\/2Cr"r 



For 7i = 72 = 73 this quantity simplifies to 



1 arctan 

71 



1 

V2 



— = 1.21635. 



(33) 
(34) 

(35) 
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We remark that the differential equations (^) for a given r have the same form as 
the differential equations found for the simple perceptron with Hebbian learning 0. 
More specifically, they differ only by the value of the coefficient br, which for the simple 
perceptron is equal to \/2/it = 0.798. 

For the Hebbian learning we are considering, it is possible to construct a simple 
expression for function of a. Following Opper and Kinzel |I| we slightly modify 

the update rule (|), (0), (0) (substituting by l/y/N) and easily arrive at 



Pr = \l^i- (36) 

where we have taken as initial condition p(0) = and where 



.V-2jJi ■ 

This expression differs from the solution of (p2| ) only for small values of a and has the 
advantage of having a simple form. The evolution of p in the case of simple perceptrons 
is described by the single equation (|36D, but with a coefficient = V^- Since these 
results are very similar to the results obtained for the simple perceptron we do not test 
other algorithms in this scenario because we expect that also in those cases a strong 
resemblance to the simple perceptron occurs. 

In the second scenario with the learning rule F defined in subsection |2.2| we have 
to solve the following set of differential equations: 

= {hiaF{'jihi, 73/13, s, a)) 11 + ^ (F^(7i/ii, 73/13, s, a))^^ (38) 

= ^ (cr^(7i^i, 73/^3, s,(t) (/i2 -p/ii))/7 - ^ (F^(7i^i, 73/^3, s,cr))^^. (39) 

Performing the averages leads to much more complicated expressions than those 
obtained in the first scenario. The explicit form of these expressions obtained for 
Hebbian, perceptron and Adatron learning with the distributions P± and P+ can be 
found in [Appendix A . 



5. Results 

In this section we discuss the numerical solutions of the differential equations (0) and 
(^)- (|39|) and compare them with the results of simulations. Because only the ratios of 
the strength parameters 71, 72 and 73 are important we take 71 = 72 = 1, and vary only 

73- 



5.1. Scenario I 

The learning curves for small values of the number of examples a obtained in the first 
scenario using formula (^) are presented in figure |I]. All curves start with an initial 
generalisation error Sg = 0.75 corresponding to random guessing in four-state models. 
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For 73 = learning between two independent perceptrons is described. For 73 = 1 the 
learning curve is identical with the one of the 4-state Potts perceptron (cfr. [|11|, |12[)- 



In the limit a ^ 00, Eg decays like a 2 for all values of 73, precisely like in the case of 
learning between simple perceptrons. 



5.2. Scenario II 

A careful analysis of expression (pSf ) leads to the conclusion that in the second scenario 
the generalisation error can be nonzero even when the normalised angle between the 
student and the teacher couplings, (j) = arccos(p)/7r, is equal to zero. This happens 
when we allow the field /13 to take negative values. Therefore, we follow the evolution 
of two dynamical variables in the sequel: the generalisation error Eg and the normalised 
angle between the student and the teacher 0. For all the learning algorithms and 
distributions of the fields that we have considered, we observe an abrupt change in the 
asymptotic behaviour in a when 73 changes from to some non-zero value. Logarithmic 
plots of the learning curves for two distributions of the fields, P± and P+, are presented 
in figures |^-|^. The learning curves for the distribution P_ are qualitatively very similar 
to the curves obtained for P±. 



5.2.1. Pi I = P± Let us first analyse the results obtained for the distribution P± in 
more detail. For 73 7^ 0, the generalisation error saturates at some non-zero value. For 
Hebbian and perceptron learning the angle between the student and the teacher is 
asymptotically decreasing to zero at a higher rate than in the decoupled case 73 = 0. 
For Hebbian learning we find that in the limit a 00, (p ~ a~^, versus (p ~ for 
73 = 0, while in the case of the perceptron algorithm (p ~ a~^, versus ~ for 
73 = 0. For the Adatron algorithm and Eg both saturate at some non-zero value. In 
spite of the fact that the generalisation error never vanishes the student is able to learn 
the couplings of the teacher using the Hebbian or perceptron algorithm. 



5.2.2. Pjj = P+ We observe that for all algorithms the generalisation error goes 
asymptotically to zero. For Hebbian and perceptron learning it decreases faster than 
in the decoupled case. In the limit a —>■ 00, we get Eg ~ for Hebbian learning 
while Eg ~ for perceptron learning. For Adatron learning we obtain the same 
decay exponent as in the decoupled case. Surprisingly, for the perceptron and Adatron 
algorithms the decay of the angle between the student and the teacher, 0, is slower than 
in the decoupled case in the limit a — > 00. For the perceptron we have ~ and for 
the Adatron we find ~ a~^. On the contrary, for Hebbian learning ~ as for 
the decoupled case. 

Since an analytic analysis of the differential equations (see [Appendix A] ) is rather 
involved, the asymptotic exponents discussed above have been determined numerically. 
Only in the case of Hebbian learning with the field distribution P+ the numerical analysis 
was not entirely unambiguous. Therefore, we have derived the corresponding exponents 
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analytically. Details can be found in [Appendix B 



The initial generalisation error is a function of the strength parameter 73, which 
measures the strength of the coupling between the two perceptrons. The larger the 73 
the bigger the common knowledge between the student and the teacher, so the smaller 
the initial error. For 73 — > cxd, the student and the teacher use precisely the same rule 
(|I^) in order to determine their outputs. 

Finally, the numerical solution of the equations (^8|) -(^9l) suggests that there is 
a simple relation between the decay exponents of and Eg, denoted by and yg 
respectively, 

ya = '2y^- (40) 



This relation can also be derived analytically (see [Appendix B| ). For 71 = 72 we find in 
the limit a — 00 (and 0-^0) that 

,2 



TC 



^0 , (41) 



confirming the observation 



5.3. Computer simulations 

To check the analytic results described above we have performed numerical simulations. 
The system sizes have been varied between N = 100 and N = 999 neurons. An 
excellent agreement has been found for both scenarios and all learning algorithms, even 
for relatively small A^. As a representative example we present a comparison between 
simulations and analytic results obtained in the second scenario with the Adatron 
algorithm for 73 = 0.1 and Pu = P±. For the sake of clarity we show the results 
obtained for small and big a separately. The analytic results for small a are compared 
with simulations for a system with A^ = 999 neurons (fig. §). For bigger a we have 
made simulations for smaller systems (A^ = 100), which are displayed in fig. In both 
cases only the results obtained for one sample are shown. 

For small a the simulations are smoothly aligned along the theoretical curves. This 
points to the self averaging property of the learning process. For bigger values of a 
very strong fluctuations occur around the theoretical result. This happens only for 
the Adatron algorithm and Pu = P± and, hence, cannot be explained entirely by the 
relatively small size of the system. Indeed, as has been noticed in section ^, in this case 
there is always a non-zero fraction of disagreement between the student and the teacher. 
So, a strategy used by the Adatron algorithm which updates the couplings proportional 
to the error made by the student, must lead to rather big random changes. Nevertheless 
the simulation points in fig. ^ are evenly distributed on both sides of the theoretical 
curve. 
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6. Conclusions 

In this paper we have studied on-line learning and generalisation using the AT 
perceptron. Two learning scenarios have been considered. The results obtained in 
the first scenario, where the student and the teacher are represented by independent 
AT perceptrons, are very similar to the results obtained for the simpler models f^. For 
a particular choice of the network parameters the learning curve precisely reproduces 
that found for the 4-state Potts perceptron p. 

In the second scenario the student and the teacher are taken to be simple 
perceptrons coupled by a four-neuron interaction term. Particular results depend 
crucially on the distribution of the couplings J3. 

For the field distribution Pjj = P± the generalisation error always saturates at 
some non-zero value. This is not surprising since this distribution allows the field to 
take negative values what inevitably leads to a non-vanishing fraction of disagreements 
between the student and the teacher even when Ji = J2 (cfr. (Q)). In spite of this, 
for Hebbian and perceptron learning the student manages to learn the couplings of the 
teacher perfectly (in the limit a —>■ 00). This does not happen, however, for the Adatron 
algorithm, which in the standard (decoupled) situation proved to be the fastest 0. The 
reason is that this algorithm changes the couplings of the student proportionally to the 
error made by the latter. Since this error is non-zero even for Ji = J2, this cannot be 
a good strategy. Hence, the more "blind" updates (Hebbian and perceptron) appear to 
be more effective. 

For Pjj = P+ we have obtained quite different results. In this case the generalisation 
error goes to zero when p goes to 1. For Hebbian and perceptron learning we observe 
faster decay of Eg than in the decoupled case. For Adatron learning the decay exponent 
of Eg is the same as for 73 = 0. Surprisingly, for all algorithms we find the same or 
slower decay of </> compared with the decoupled case. 

The best asymptotic decay of the generalisation error has been obtained for 
Pji = P+ with the Adatron rule: Eg ~ 0.618a~^. Comparing with the case of 
independent perceptrons we see that it is better than the lower bound for on-line learning 
{Eg ~ O.SSa^"*^) and worse than the Bayesian lower bound {Eg ~ OAAa"^). 

We remark that in the course of a learning process in the second scenario also the 
teacher mapping is changed but not the teacher couplings. This can be interpreted 
as a kind of effective mutual learning caused by the ("hardware") coupling of the two 
perceptrons. This is different from the mutual learning process analysed in |]T6|, 0, the 



only other learning process of this type known to us. There, in contrast to our setup, 
the teacher explicitly learns from the student. In our model the decay exponent of Eg is 
not influenced by a particular value of the strength parameter 73 as long as it is nonzero. 

The model analysed in the second scenario with Pu = P+ where a part of the 
learning rule is shared by the teacher and the student, can be compared to a real life 
situation in which both of them, e.g., have the same cultural background, followed the 
same education . . . One can expect that in such a situation the learning process is much 
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more efficient since the student and the teacher speak in a sense the same language. It 
corresponds to a faster asymptotic decay of the generahsation error in our model. It 
would be interesting to see, e.g., whether an optimisation of the learning process [0 
would still improve these results. 
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Appendix A. The evolution of the order parameters in the second learning 
scenario 



The set of differential equations (|38D -(|39D for the order parameters in the second learning 



scenario can be written down in the following form: 

^ = flip, 7i3, 723) + TT—f^iP' ^13, 723) 
da 2ni 

= —Mp^ 7i3, 723) - /a (P' 7i3, 723) 
da ni 2n| 

with -frr' = 7r/7r' and where the exphcit form of /i(p, 713, 723), /2(p, 7i3, 723) and 
/3(P5 713, 723) depends on the algorithm used and on the distribution of the fields. 
In the case of the distribution Pjj = P± we have for 

Hebbian learning 

/i(p, 713,723) = p/21 +5-21 

/2(P,713,723) = 1 

/3(P, 7l3, 723) = /2l(l - P^) - P5'21 

Perceptron learning 

/i(p, 713,723) = ^(p/21 - /12 + 5-21) 
/2(p, 713, 723) = - arccos(p) + /± 

/3(P, 713, 723) = 2 (/2l(l - P^) - 9l2 - 9921) 



Adatron learning 

/l (P, 713, 723) = - 71 f /a - /l^ + /l2 + ^ 



73 ^12 - Pt21 + ' 



27r 



^21 V '^21 , 



/2(p, 7i3, 723) = li [ fa - fu + /i2 + ^ j - 73 f ^ arcsin(p) - I± - ^ 
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+ 7l73 ti2 - 2pt2i + 



71 



+ 



21 



'-21 



/3(P,7l3,723) = 7l 



(Vl-P' + P arcsin(p)) + (72^1 + ^?i2 + P {fa + A^i - /21) 



+ 73 t2l(l-p') + 



2, , v^r^ / 1 1 



27r 



+ 



P ^ P 



-12 



'-21 



with 



'2 _ r 

X 



D/i^. hj. ( 1 



2 _ / /l, (7rr' + P) 

W2(l-P' 



erf 



/ /ir (7rr' - p) 

V v/2(l - P^ 



1 /I -p2 / 2 

- — y 1 arctan 

hrr' V 27r V 7r 



7r3 



i/rr' 



i-p2 r 1 



27r I a2^, 



TT 



(l + 73VrV)""-l 



1 /l-p2/ 2 

\ I 1 arctan 

arr' V 27r 

1 



7r3 



fa = / D/ii /i?erf 



/iip 



00 



X 



D/io I 1 - erf 



2(l-p^) 



f± = - 



[ 723/^2 ^ 



° D/i, /i^ 1^1 + erf ^^'''^^ 



V2 



erf 



D/ii /i^ sinh(p/ii/i2) , 



/ K (7rr' ± P) 

W2(l-p2 



''rr' 



2vrCr3 V (7rr' ± p) 



+ 



Cr327r 



, b„i b^^, b^^i , 



± 1 , ^ \2 , (7rr' ±P)^ 
C^^, = 1 + (7^3) H ^ , a^.t 



1 + (7rr')^ - 27„/p 



P^ 



1 -p2 



'l + (7rr') +27 rr' P 



l-p2 



where J± is given by expression (^) and c^s is defined in expression (plj). 
In the case of the distribution P// = P+ we have for 

Hebbian learning 

/l(P,7l3 , 723) = Pf'2t +921 

/2(P,713,723) = 1 

/3(P, 7l3, 723) = (1 - P^)/2l - P^21 

Perceptron learning 

/i(p, 713, 723) = ^ (p/21 - /it + gti) 
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/2(p, 713, 723) = - arccos(p) + 



h{p, 713, 723) = ^ (/2t (1 - P^) - ^1^2 - P^^l) 



Adatron learning 

/l(p,7l3 , 723) 



71 



/+ - 2/+ -ga + - 



+ 73 

/2(P, 713,723) =ll 

+ 7i73 

/3(P,713,723) = 7l 
+ 73 



1 . s + + 1 / 1 - P^ 

_(l_p)_2t+ +2pt+ / 



TT 



TT \ C. 



21 



/a^ - 2/1+2 -9a + 



— arcsin(p) — /+ 

TT 2 



2 . ^ 4- , 2 / 1 - p2 



7r 



TT \/ C 



21 



27273^ 



21 



^ (Vl - p2 + parcsin(p)) + p/+ + p(7a + 2 ((^^^a + ^^21 + P/2I) 



2t2i(l-p' 



1-p^ 



l-p2 /l-p2 
-+ P\ 



-12 



-21 



with 



J a 



phi 



a/2(1-P^) 



2(l-p^) 







X / D/i2(^l+erf(^^V"l-p2 
r-O 



-721 1^12! 



/r^ = \/ - + 2 / D/i /i ( 1 + erf 



(^)) 



1 + erf 



D/ii hi exp {— phih2) 

h {'Jrr' + p) 

v/2(l-p2) 



1 + erf 



/ 7lA^ 



fi'rr' 



2 /W!A_2^,,^/2ri 

67,^,/ V 27r V TT V 0,^.^./ 



Qrr' 



i-p2 1 



(1 + 73AV) 



277 52^, 

and where /+ is given by expression (|29|) . 



Appendix B. The asymptotic form of the solution in the second scenario 
for Hebbian learning with P// = P+ 

Because the dependence of the generalisation error on the overlap p is rather 
complicated (see (]28|)) we derive the asymptotic form for £:+ in two steps. First, we 



find the asymptotic relation between e+ and and then we determine the behaviour of 
(/) as a function of a in the limit a 00. 
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Appendix B.l. Asymptotic relation between and 
The generalisation error is defined as (see (pS])): 

e'l = — arccos p — — U21 = 4> — "^^12 ~ ""21 1 



with the integrals U21 given by (|30|) . We now expand these integrals as a function 
of (f) for small values of (f). First, we change the variables to get 

= 4> I — da:(l + erf(a0x))(l + erf(cx)) = / dx/((/),a;), 



where 



Ir Irr' + 1 

C 



73V^ ' TTV^ 

Expanding x) with respect to (p and taking 71 = 72 = 1 we get 

\J2txc 4c^vr 



and this leads to 

~ 4V273 

Appendix B.2. Asymptotic relation between cj) and a 



< = -^02 + 0(03) . (B.l) 



The differential equations (|38D-(p9D can be written down in terms of the variables ui 
and 0. For Hebbian learning and Pn = P+ this gives 

^ = /i(cos(7r0), 713,723) + 7r^/2(cos(7r0), 713,723) (B.2) 
da 2ni 

d0 /3(cos(7r0), 713,723) , cos(7r0) . . , , x \ m Q^ 

T- = ^T^\ + — 2 • / ,n /2 cos 7r0 , 713,723 B.3 

da ni7rsm(7r0j 27™^sm(7r0j 

The functions /i(cos(7r0), 713, 723), /2(cos(7r0), 713, 723) and /3(cos(7r0), 713, 723) are 



defined in [Appendix A| . Expanding the r.h.s of the differential equations (|B.2| 



around = up to the first non-vanishing term we can easily find that for 71 = 72 



V2 

-a 



27r(v^- 1) 

Combining this result with ( [B.l| ) we obtain the asymptotic formula for the generalisation 



error: 



-I- ^ -1 / -3x 

e'T = ^ a + o(a 2) 

' 873(v^-l) 
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Figure 1. Learning scenario I: the generalisation error Eg as a function of the number 
of examples a with 71 = 72 = 1 and 73 = 00, 1, from top to bottom 




5 10 

logio(«) 

Figure 2. Learning scenario IL Log-log plot of the generalisation error, (solid lines), 
and the normalized angle between the teacher and the student, cf) (broken line), for 
P = P± and Hebbian learning as a function of the number of examples a. Intermediate 
curves not marked on the figure are for 73 = 0.01, 0.1, 1 
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Figure 3. As in fig. ^ but for the perceptron algoritlim. Intermediate curves not 
marked on tlie figure are for 73 = 0.5, 1 
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logio(«) 



Figure 4. As in fig. g but for tfie Adatron algoritlim. Intermediate curves not marked 
on the figure are for 73 = 0.1, 0.5. 
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Figure 5. Learning scenario II: Log-log plot of Eg (solid lines) and (p (broken line), 
for P = P+ and Hebbian learning as a function of a. Intermediate curves not marked 
on the figure are for 73 = 0.1, 1.0, 9.9. 




Figure 6. As in fig. H but for the perceptron algorithm. 
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Figure 8. Second learning scenario with Adatron learning and 73 = 1. Simulations 
(grey circles) with = 999 versus theoretical results (solid black line) for as a 
function of a. 
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Figure 9. As in fig.g with N = 100. 



